{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Lesson 04 - Pandas part 2\n", "\n", "Welcome to lesson 5! In this lesson we will continue on with pandas as our main way of storing data. NumPy will be useful when we have a uniform multidimensional data set, but for now pandas will be our default.\n", "\n", "Again, an exhaustive run through of pandas is too much for this class - see the book - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) or the [official pandas website](http://pandas.pydata.org/pandas-docs/version/0.17.1/).\n", "\n", "If you are familiar with [R](http://pandas.pydata.org/pandas-docs/version/0.17.1/comparison_with_r.html), [SAS](http://pandas.pydata.org/pandas-docs/version/0.17.1/comparison_with_sas.html), and/or [SQL](http://pandas.pydata.org/pandas-docs/version/0.17.1/comparison_with_sql.html), click on the links to lead you to the intro to pandas for users of each language.\n", "\n", "Please download todays notebook [here](/pythoncourse/assets/notebooks/r&d/lesson 04.ipynb).\n", "\n", "### Data Import\n", "\n", "Importing data is the most important first step to get our data in. Today we will cover read_csv, before the end of the course we will talk about how to connect to your netezza (and other SQL) databases" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pandas import DataFrame, Series\n", "import pandas as pd\n", "import io\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a ton of ways of reading data into and writing data out of pandas. See the [dataIO page](http://pandas.pydata.org/pandas-docs/stable/io.html) for more details." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "date,A,B,C\n", "20090101,a,1,2\n", "20090102,b,3,4\n", "20090103,c,4,5\n", "\n" ] } ], "source": [ "#using a string as example\n", "#we could refer to file names if we had the file saved\n", "data = '''\n", "date,A,B,C\n", "20090101,a,1,2\n", "20090102,b,3,4\n", "20090103,c,4,5\n", "'''\n", "print(data)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateABC
020090101a12
120090102b34
220090103c45
\n", "
" ], "text/plain": [ " date A B C\n", "0 20090101 a 1 2\n", "1 20090102 b 3 4\n", "2 20090103 c 4 5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#by default, the index is arange(nrows)\n", "pd.read_csv(io.StringIO(data))\n", "#recall index objects: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
date
20090101a12
20090102b34
20090103c45
\n", "
" ], "text/plain": [ " A B C\n", "date \n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can specify the index:\n", "pd.read_csv(io.StringIO(data), index_col=0)\n", "#also index_col='date'" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " B C\n", "date A \n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5\n", " B C\n", "A \n", "a 1 2\n" ] } ], "source": [ "#we can also use nested indices:\n", "x = pd.read_csv(io.StringIO(data), index_col=['date','A'])\n", "print(x)\n", "print(x.loc[20090101])" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
date
20090101a12
20090102b34
20090103c45
\n", "
" ], "text/plain": [ " A B C\n", "date \n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can specify data type (it will speed things up, or avoid conversion)\n", "pd.read_csv(io.StringIO(data), index_col=['date'], dtype={'A' : str, 'B':np.int32, 'C':np.float64})" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foobarbaz
20090101a12
20090102b34
20090103c45
\n", "
" ], "text/plain": [ " foo bar baz\n", "20090101 a 1 2\n", "20090102 b 3 4\n", "20090103 c 4 5" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We can throw out names and use our own\n", "pd.read_csv(io.StringIO(data), index_col=[0],\n", " dtype={'A' : str, 'B':np.int32, 'C':np.float64},\n", " names=[\"foo\", 'bar', \"baz\"],\n", " header = 0)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
foobaz
0a2
1b4
2c5
\n", "
" ], "text/plain": [ " foo baz\n", "0 a 2\n", "1 b 4\n", "2 c 5" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#filter out some unneeded columns:\n", "pd.read_csv(io.StringIO(data),\n", " names=['date', 'foo', 'bar', \"baz\"],\n", " header = 0,\n", " usecols = ['foo', 'baz'])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype='datetime64[ns]', name='date', freq=None)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#dates! more to come\n", "dat = pd.read_csv(io.StringIO(data),\n", " parse_dates = True,\n", " index_col = [0]\n", " )\n", "dat.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick Review\n", "\n", "I mentioned in the previous lecture we can use all our base and NumPy methods on pandas DataFrames: Here is a quick review taken from the SQL lesson:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2\n", "4 24.59 3.61 Female No Sun Dinner 4" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips = pd.read_csv('https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv')\n", "tips.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL select:\n", "```\n", "SELECT total_bill, tip, smoker, time\n", "FROM tips\n", "LIMIT 5;\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsmokertime
016.991.01NoDinner
110.341.66NoDinner
221.013.50NoDinner
323.683.31NoDinner
424.593.61NoDinner
\n", "
" ], "text/plain": [ " total_bill tip smoker time\n", "0 16.99 1.01 No Dinner\n", "1 10.34 1.66 No Dinner\n", "2 21.01 3.50 No Dinner\n", "3 23.68 3.31 No Dinner\n", "4 24.59 3.61 No Dinner" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips[['total_bill', 'tip', 'smoker', 'time']].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL where:\n", "```\n", "SELECT *\n", "FROM tips\n", "WHERE time = 'Dinner' AND tip > 5.00;\n", "LIMIT 5;\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
2339.427.58MaleNoSatDinner4
4430.405.60MaleNoSunDinner4
4732.406.00MaleNoSunDinner4
5234.815.20FemaleNoSunDinner4
5948.276.73MaleNoSatDinner4
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "23 39.42 7.58 Male No Sat Dinner 4\n", "44 30.40 5.60 Male No Sun Dinner 4\n", "47 32.40 6.00 Male No Sun Dinner 4\n", "52 34.81 5.20 Female No Sun Dinner 4\n", "59 48.27 6.73 Male No Sat Dinner 4" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)].head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Groupby\n", "\n", "Grouping and summarising data allows us to carry out the key data analysis steps of [split, apply, combine](https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf). The journal article by Hadley Wickham was one of the first formalisations of the split apply combine paradigm, and we can of course do it in Python.\n", "\n", "* Splitting the data into groups based on some criteria\n", "* Applying a function to each group independently\n", "* Combining the results into a data structure\n", "\n", "Let's continue on with our analysis of the tips data:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tipsgroups = tips.groupby('sex')\n", "tipsgroups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a new data type, the groupby object.\n", "We can access the attribute, groups. This is a dict, with each level as it's own entry and the indices of the original data frame:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Female\n", "Male\n" ] } ], "source": [ "for i,j in tipsgroups.groups.items():\n", " print(i)\n", "#print(tipsgroups.groups)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do grouping on any axis, or with a custom function (this example is pathological):" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{0: ['tip', 'sex', 'day', 'time', 'size'], 1: ['total_bill', 'smoker']}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def myfun(index):\n", " if len(index) >= 5:\n", " return 1\n", " else:\n", " return 0\n", "\n", "group2 = tips.groupby(myfun, axis = 1)\n", "group2.groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use tab completion to see all our methods and attributes:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
sex
Female18.0568972.8334482.459770
Male20.7440763.0896182.630573
\n", "
" ], "text/plain": [ " total_bill tip size\n", "sex \n", "Female 18.056897 2.833448 2.459770\n", "Male 20.744076 3.089618 2.630573" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "tipsgroups.mean()\n", "#tipsgroups.boxplot();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can iterate through groups:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Female\n", " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "4 24.59 3.61 Female No Sun Dinner 4\n", "11 35.26 5.00 Female No Sun Dinner 4\n", "14 14.83 3.02 Female No Sun Dinner 2\n", "16 10.33 1.67 Female No Sun Dinner 3\n", "Male\n", " total_bill tip sex smoker day time size\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2\n", "5 25.29 4.71 Male No Sun Dinner 4\n", "6 8.77 2.00 Male No Sun Dinner 2\n" ] } ], "source": [ "for name, group in tipsgroups:\n", " print(name)\n", " print(group.head(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To apply, we can use .aggregate:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "sex\n", "Female 2.833448\n", "Male 3.089618\n", "Name: tip, dtype: float64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tipsgroups.aggregate(np.mean)\n", "#selecting columns:\n", "tipsgroups['tip'].aggregate(np.mean)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
meansumstdmeansumstdmeansumstd
sex
Female18.0568971570.958.0092092.833448246.511.1594952.4597702140.937644
Male20.7440763256.829.2464693.089618485.071.4891022.6305734130.955997
\n", "
" ], "text/plain": [ " total_bill tip size \\\n", " mean sum std mean sum std mean \n", "sex \n", "Female 18.056897 1570.95 8.009209 2.833448 246.51 1.159495 2.459770 \n", "Male 20.744076 3256.82 9.246469 3.089618 485.07 1.489102 2.630573 \n", "\n", " \n", " sum std \n", "sex \n", "Female 214 0.937644 \n", "Male 413 0.955997 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#.agg is short for agg\n", "tipsgroups.agg([np.mean, np.sum, np.std])" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sizetip
<lambda>meansum
sex
Female62.833448246.51
Male63.089618485.07
\n", "
" ], "text/plain": [ " size tip \n", " mean sum\n", "sex \n", "Female 6 2.833448 246.51\n", "Male 6 3.089618 485.07" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can also use a dict, to do different things to different rows:\n", "tipsgroups.agg({'tip': [np.mean, np.sum], 'size':lambda x: max(x)})\n", "#see also .apply()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also filter, transform, plot, count etc etc. Take a look in the help for more details!\n", "\n", "### Joins\n", "\n", "We can use a variety of joins in pandas, the most basic using the concat function:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df1 = DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3'],\n", " 'C': ['C0', 'C1', 'C2', 'C3'],\n", " 'D': ['D0', 'D1', 'D2', 'D3']},\n", " index=[0, 1, 2, 3])\n", "df2 = DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n", " 'B': ['B4', 'B5', 'B6', 'B7'],\n", " 'C': ['C4', 'C5', 'C6', 'C7'],\n", " 'D': ['D4', 'D5', 'D6', 'D7']},\n", " index=[4, 5, 6, 7])\n", "#print(df1)\n", "#print(df2)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
5A5B5C5D5
6A6B6C6D6
7A7B7C7D7
\n", "
" ], "text/plain": [ " A B C D\n", "0 A0 B0 C0 D0\n", "1 A1 B1 C1 D1\n", "2 A2 B2 C2 D2\n", "3 A3 B3 C3 D3\n", "4 A4 B4 C4 D4\n", "5 A5 B5 C5 D5\n", "6 A6 B6 C6 D6\n", "7 A7 B7 C7 D7" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#joins on index\n", "pd.concat([df1, df2])" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDABCD
0A0B0C0D0NaNNaNNaNNaN
1A1B1C1D1NaNNaNNaNNaN
2A2B2C2D2NaNNaNNaNNaN
3A3B3C3D3NaNNaNNaNNaN
4NaNNaNNaNNaNA4B4C4D4
5NaNNaNNaNNaNA5B5C5D5
6NaNNaNNaNNaNA6B6C6D6
7NaNNaNNaNNaNA7B7C7D7
\n", "
" ], "text/plain": [ " A B C D A B C D\n", "0 A0 B0 C0 D0 NaN NaN NaN NaN\n", "1 A1 B1 C1 D1 NaN NaN NaN NaN\n", "2 A2 B2 C2 D2 NaN NaN NaN NaN\n", "3 A3 B3 C3 D3 NaN NaN NaN NaN\n", "4 NaN NaN NaN NaN A4 B4 C4 D4\n", "5 NaN NaN NaN NaN A5 B5 C5 D5\n", "6 NaN NaN NaN NaN A6 B6 C6 D6\n", "7 NaN NaN NaN NaN A7 B7 C7 D7" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#joins on index!\n", "pd.concat([df1, df2], axis = 1)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDABCD
0NaNNaNNaNNaNA0B0C0D0
1NaNNaNNaNNaNA1B1C1D1
2NaNNaNNaNNaNA2B2C2D2
3NaNNaNNaNNaNA3B3C3D3
4A4B4C4D4NaNNaNNaNNaN
5A5B5C5D5NaNNaNNaNNaN
6A6B6C6D6NaNNaNNaNNaN
7A7B7C7D7NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " A B C D A B C D\n", "0 NaN NaN NaN NaN A0 B0 C0 D0\n", "1 NaN NaN NaN NaN A1 B1 C1 D1\n", "2 NaN NaN NaN NaN A2 B2 C2 D2\n", "3 NaN NaN NaN NaN A3 B3 C3 D3\n", "4 A4 B4 C4 D4 NaN NaN NaN NaN\n", "5 A5 B5 C5 D5 NaN NaN NaN NaN\n", "6 A6 B6 C6 D6 NaN NaN NaN NaN\n", "7 A7 B7 C7 D7 NaN NaN NaN NaN" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#we can ignore the index!\n", "df2.append(df1, ignore_index=True)\n", "#df2.append(df1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more control, we might want to explicitly use merge!\n", "\n", "We have the standard joins - inner, outer, left, right, full and union:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
keyvalue_xvalue_y
0A-1.259100NaN
1B-2.293441-0.569932
2C-0.415496NaN
3D0.7868471.958187
4D0.786847-0.275405
5ENaN0.549911
\n", "
" ], "text/plain": [ " key value_x value_y\n", "0 A -1.259100 NaN\n", "1 B -2.293441 -0.569932\n", "2 C -0.415496 NaN\n", "3 D 0.786847 1.958187\n", "4 D 0.786847 -0.275405\n", "5 E NaN 0.549911" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = DataFrame({'key': ['A', 'B', 'C', 'D'],\n", " 'value': np.random.randn(4)})\n", "df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],\n", " 'value': np.random.randn(4)})\n", "#SQL:\n", "#SELECT *\n", "#FROM df1\n", "#INNER JOIN df2\n", "# ON df1.key = df2.key;\n", "pd.merge(df1, df2, on='key')\n", "#SQL:\n", "#SELECT *\n", "#FROM df1\n", "#LEFT OUTER JOIN df2\n", "# ON df1.key = df2.key;\n", "pd.merge(df1, df2, on='key', how='left')\n", "#SQL:\n", "#SELECT *\n", "#FROM df1\n", "#RIGHT OUTER JOIN df2\n", "# ON df1.key = df2.key;\n", "pd.merge(df1, df2, on='key', how='right')\n", "#SQL:\n", "#SELECT *\n", "#FROM df1\n", "#FULL OUTER JOIN df2\n", "# ON df1.key = df2.key;\n", "pd.merge(df1, df2, on='key', how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have not covered a bunch of stuff: [Pivot tables and reshaping](http://pandas.pydata.org/pandas-docs/stable/reshaping.html), [window functions (which were completely updated yesterday)](http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#whatsnew-0180-enhancements-moments) and [time series](http://pandas.pydata.org/pandas-docs/stable/timeseries.html). We will cover these as we continue throughout the course." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example\n", "\n", "From here, we will look at a worked example of data analysis using pandas. In the first lesson, we looked at the example of the tennis fixing scandal, and briefly ran through it. Now we have the skills and knowledge to walk through it, and assess the analysis.\n", "\n", "Here's the link to the [original article](http://www.buzzfeed.com/heidiblake/the-tennis-racket) and the [notebook on github](https://github.com/BuzzFeedNews/2016-01-tennis-betting-analysis) " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }